[AutoDiff] Autodiff 9: Guard against LLVM worker-thread stack overflow from large per-task adstack budget#495
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5d3009268
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
e281a6e to
196977b
Compare
a5d3009 to
79a56b0
Compare
196977b to
c73cb3d
Compare
91f44c5 to
6abc4aa
Compare
c73cb3d to
3d0ecaf
Compare
6abc4aa to
bedaa69
Compare
3d0ecaf to
7b52cc8
Compare
bedaa69 to
6ba6b0a
Compare
7b52cc8 to
2a30384
Compare
6ba6b0a to
7e7b15d
Compare
2a30384 to
b390eb0
Compare
7e7b15d to
d68e626
Compare
b390eb0 to
4984a2b
Compare
d68e626 to
bd14f25
Compare
|
Opus review of test factoriztaion: Looking at the four tests side by side:
Tests 2, 3, and 4 share identical setup/teardown/assertion structure and only differ in two dimensions: dynamic vs static loop, and fixed vs varying coefficient. These three could collapse into a single parametrized test using @pytest.mark.parametrize("use_static_loop", [True, False])
@pytest.mark.parametrize("use_varying_coeff", [True, False])
@test_utils.test(require=qd.extension.adstack, ad_stack_experimental_enabled=True)
def test_adstack_sum_linear(use_static_loop, use_varying_coeff):
x = qd.field(qd.f32)
y = qd.field(qd.f32)
qd.root.dense(qd.i, 1).place(x, x.grad)
qd.root.place(y, y.grad)
@qd.kernel
def compute():
for i in x:
v = x[i]
if qd.static(use_static_loop):
r = qd.static(range(3))
else:
r = range(3)
for a in r:
if qd.static(use_varying_coeff):
y[None] += v * qd.cast(a + 1, qd.f32)
else:
y[None] += v
x[0] = 1.0
y[None] = 0.0
compute()
y.grad[None] = 1.0
x.grad[0] = 0.0
compute.grad()
expected = sum((a + 1) for a in range(3)) if use_varying_coeff else 3.0
assert x.grad[0] == test_utils.approx(expected, rel=1e-4)This gives you 4 combinations (including static + varying coeff, which is new free coverage) in one test, and eliminates ~50 lines of duplication.
|
|
Please could you address Opus's suggetsions on test factroization. |
| // The stack index `n` is clamped on read so that overflow (push past capacity) does not let subsequent pops and | ||
| // top-accesses underflow it and index far out of bounds. The corresponding stack_push sets | ||
| // `runtime->adstack_overflow_flag` and skips the increment instead of trapping, so the host-side launcher | ||
| // surfaces the failure as a Python exception rather than killing the process via __builtin_trap. When n == 0 |
|
Asking Opus about any runtime impact: Hot-path branches in stack operations. stack_push, stack_pop, and stack_top_primal are called on every iteration of every AD loop that uses the adstack. This PR adds a new • stack_push: bounds check if (n + 1 > max_num_elements) before every push, plus an extra LLVMRuntime *runtime pointer argument threaded through every call site. In normal (non-overflowing) operation the branches are always not-taken, so CPU branch prediction should handle them well. On GPU, all threads in a warp would take the same synchronize() poll. Every qd.sync() now calls check_adstack_overflow(), which makes a JIT call into the runtime to read and reset the flag, plus a result buffer fetch. This is In practice, the cost is probably negligible — a couple of predictable branches per stack operation. But if someone wanted to benchmark it, the stack push/pop overhead in a |
|
Opus response for test coverage: Code changes in this PR:
Untested functionality implied by the design:
|
|
Please could you add tests for untested fucntionality implied by the design:
|
|
Oh, also please add tests for the rows in the table above with 'no', ie:
|
4984a2b to
1cb9ad7
Compare
bd14f25 to
d1f04cc
Compare
There was a problem hiding this comment.
LGTM — all review issues addressed, including the CPU-only arch guard on the 256 KB adstack budget check.
Extended reasoning...
Overview
The PR adds adstack overflow detection to the LLVM autodiff runtime: stack_push now sets runtime->adstack_overflow_flag via a relaxed atomic store on overflow, stack_pop/stack_top_primal have underflow/clamping guards, LlvmRuntimeExecutor::check_adstack_overflow() polls the flag on every synchronize(), and LlvmProgramImpl::pre_finalize() suppresses the poll during teardown. Additionally AdStackAllocaStmt::size_in_bytes() is corrected from sizeof(int32) to sizeof(int64) to match the runtime layout. 11 files touched across codegen, IR, runtime, and tests.
Security risks
No security-sensitive surfaces (no auth, crypto, or permission code). The relaxed-atomic fix for the data race in stack_push is a correctness improvement.
Level of scrutiny
This is production-critical runtime code (autodiff correctness on all LLVM backends), but the changes are well-bounded safety additions rather than logic rewrites. Every bug raised across multiple review rounds was addressed:
- Comment inaccuracies (error_code vs adstack_overflow_flag) — fixed
- finalizing_ timing (pre_finalize() before teardown syncs) — fixed
- sizeof header mismatch — fixed
- Atomic write for multi-threaded overflow — fixed
- CPU-only arch guard on the 256 KB budget check — fixed (fixup commit 851d8fd)
- test_stack heap leak — fixed
- Test ad_stack_size pins and extension guards — fixed
Other factors
Test coverage is comprehensive: overflow raises, flag reset, teardown safety, multi-threaded overflow, and large-capacity resolution are all exercised. The bug hunting system found no new issues in the final state. The unresolved inline thread on the GPU budget check is moot because the arch guard is present in the submitted diff.
|
Opus says this PR is three ~independent fixes. Pelase could we split into three PRs? These three fixes a are orthogonal and could be pr'd sepraately?Yes — all three are independently mergeable, with one soft ordering preference. Sketch: Fix 3: AdStackAllocaStmt::size_in_bytes header size • Scope: 1 file, ~4 lines (quadrants/ir/statements.h). Fix 1: runtime overflow → Python exception + teardown safety • Scope: ~8 files (runtime.cpp, internal_functions.h, llvm_runtime_executor.{cpp,h}, llvm_program.h, program.cpp, program_impl.h, codegen_llvm.cpp for the stack_push signature Fix 2: codegen per-task stack-budget guard • Scope: 2 files (codegen_llvm.cpp, codegen_llvm.h). Suggested order
Why they're getting reviewed together anyway They share a theme — "stop silent wrong gradients in the adstack path" — and they were probably discovered together while chasing the same symptom. That's normal. But for it sounds like we need the test from fix 1 for fix 3?Yes — kind of. Let me be precise about what "needs" means here. Fix 3 is provably correct by inspection The runtime reads/writes the header through *(u64 *)stack at every site (stack_init, stack_push, stack_top_primal). The alloca was sized as if the header were int32. Changing it But it has no executable validation in isolation Without fix 1, the only test exercising the adstack at the C++ level is the old test_stack, which does 4 pushes into a max_size=16 stack — nowhere near the corrupted region. The That's why I said "obviously correct by inspection but with no executable proof" — and why fix 1's rewritten test_stack (which pushes to max_size and reads *(u64 *)stack) is the What this means for splitting Three options, in increasing test discipline:
If splitting, option 2 is the cleanest — a one-liner fix plus a one-line assertion is still a tiny PR, and it gives fix 3 its own regression pin instead of relying on fix 1's The general principle: a fix that's "obviously right" but covered only by an unrelated test is one accidental refactor away from being silently undone. A 5-line dedicated test is |
|
So lets' split into the following PRs please;
|
8a6f423 to
930d6d9
Compare
2e738b1 to
e4c60f6
Compare
d1a9c46 to
97533d1
Compare
e4c60f6 to
ba3976d
Compare
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
quadrants/runtime/llvm/runtime_module/runtime.cpp:1107-1112— The PR deletes all eight shuffle_down runtime helpers (amdgpu_shuffle_down_i32/f32/i64/f64 and cuda_shuffle_down_i32/f32/i64/f64) from runtime.cpp as incidental cleanup, but codegen_cuda.cpp (~lines 746-752) and codegen_amdgpu.cpp (~lines 449-455) still emit JIT calls to these exact symbol names when lowering subgroup shuffle_down operations. Any CUDA or AMDGPU kernel that uses qd.subgroup.shuffle_down() will fail at JIT link time with an undefined symbol error. Restore the eight deleted functions or update the codegen sites to inline the shuffle_down logic.Extended reasoning...
What the bug is and how it manifests
The PR diff shows two deletion hunks in runtime.cpp: one removing amdgpu_shuffle_down_i32/f32/i64/f64 (around line 1107) and one removing cuda_shuffle_down_i32/f32/i64/f64 (around line 1143). These are described as incidental cleanup but they are not dead code — they are JIT-linked runtime symbols that GPU codegen actively resolves at kernel compilation time.
The specific code path that triggers it
codegen_cuda.cpp contains an emit_cuda_shuffle_down() function (approximately lines 744–755) that calls:
call("cuda_shuffle_down_i32", offset, value)
call("cuda_shuffle_down_f32", offset, value)
call("cuda_shuffle_down_f64", offset, value)
call("cuda_shuffle_down_i64", offset, value)codegen_amdgpu.cpp contains an emit_amdgpu_shuffle_down() function (approximately lines 447–458) that calls:
call("amdgpu_shuffle_down_i32", offset, value)
call("amdgpu_shuffle_down_f32", offset, value)
call("amdgpu_shuffle_down_f64", offset, value)
call("amdgpu_shuffle_down_i64", offset, value)Neither codegen file is modified by this PR, so these eight call() sites remain unchanged and still reference the deleted symbols.
Why existing code does not prevent it
The call() invocations in LLVM codegen resolve symbol names from the JIT-linked runtime bitcode module at kernel compilation time. There is no compile-time check that the symbol exists in the module — the linker error only surfaces when a CUDA or AMDGPU kernel that exercises a subgroup.shuffle_down() operation is compiled. The PR's own test suite does not include CUDA/AMDGPU subgroup tests (the tests added are all CPU adstack tests), so the breakage escapes CI.
What the impact would be
Any user calling qd.subgroup.shuffle_down() on a CUDA or AMDGPU backend will receive a JIT linker error ('undefined symbol: cuda_shuffle_down_i32' or equivalent) the first time the kernel is compiled, turning previously working code into a hard failure. This is a complete regression for GPU subgroup reduction patterns.
How to fix it
Either (a) restore the eight deleted functions — they were not related to the adstack work and their deletion was purely cosmetic — or (b) update codegen_cuda.cpp and codegen_amdgpu.cpp to inline the shuffle_down logic directly instead of calling the runtime helpers.
Step-by-step proof
- PR diff, runtime.cpp hunk 1 (line ~1107): deletes amdgpu_shuffle_down_i32, amdgpu_shuffle_down_f32, amdgpu_shuffle_down_i64, amdgpu_shuffle_down_f64.
- PR diff, runtime.cpp hunk 2 (line ~1143): deletes cuda_shuffle_down_i32, cuda_shuffle_down_f32, cuda_shuffle_down_i64, cuda_shuffle_down_f64.
- grep for 'shuffle_down' in the post-PR runtime.cpp returns zero matches — the functions are gone.
- codegen_cuda.cpp (not modified by PR): emit_cuda_shuffle_down() at lines 746/748/750/752 calls call("cuda_shuffle_down_i32", ...), call("cuda_shuffle_down_f32", ...), call("cuda_shuffle_down_f64", ...), call("cuda_shuffle_down_i64", ...).
- codegen_amdgpu.cpp (not modified by PR): emit_amdgpu_shuffle_down() at lines 449/451/453/455 calls the amdgpu variants.
- When any CUDA/AMDGPU kernel containing a subgroup.shuffle_down is compiled, the JIT linker cannot resolve the missing symbols and raises an undefined symbol error.
97533d1 to
adf1dc1
Compare
ba3976d to
e82a02b
Compare
adf1dc1 to
baabb85
Compare
e82a02b to
c82e9d9
Compare
baabb85 to
727e1a4
Compare
c82e9d9 to
49c40a2
Compare
There was a problem hiding this comment.
LGTM — focused CPU-only budget guard with correct arch_is_cpu gate, properly reset per task, and a well-constructed child-process test.
Extended reasoning...
Overview
Touches three files: codegen_llvm.{cpp,h} and test_adstack.py. The change adds a single new field (ad_stack_fn_scope_bytes_) and a per-task accumulator reset in init_offloaded_task_function, then enforces a 256 KB budget via QD_ERROR_IF inside visit(AdStackAllocaStmt*). The test runs the budget-crossing kernel in a subprocess and asserts a non-zero exit code plus the expected guard message in output.
Security risks
No security-sensitive paths are touched. The change adds a compile-time guard that aborts the process on overflow — the same category of safety-check that already exists elsewhere in the codebase. No auth, crypto, or permissions code is involved.
Level of scrutiny
Low-to-moderate. The diff is small and self-contained. The two most important correctness properties — (a) the arch_is_cpu(current_arch()) gate preventing false rejections of valid GPU kernels, and (b) using current_task->name rather than the kernel-level kernel_name in the error message — are both correctly implemented in the current diff. The field-level comment in the header is already properly scoped with "On CPU arches only".
Other factors
All issues surfaced in the review thread (comment inaccuracies, teardown safety, flag-reset, ad_stack_size header mismatch, GPU guard) were addressed in earlier PRs or earlier commits of this PR. The one inline comment from today that lacks a formal "resolved" marker (codegen_llvm.h field comment) reflects code that is already correct in the current diff. The test is gated on both qd.extension.adstack and qd.extension.data64, and the subprocess approach correctly handles the QD_ERROR_IF → std::terminate termination path.
727e1a4 to
c9751cf
Compare
49c40a2 to
3804091
Compare
There was a problem hiding this comment.
LGTM — the CPU-only budget guard is correctly implemented and the inline comments I raised today were false positives (the code already used current_task->name and the header comment was already CPU-scoped).
Extended reasoning...
Overview
PR #495 adds a CPU-only per-task adstack budget guard to TaskCodeGenLLVM::visit(AdStackAllocaStmt*). The change touches three files: codegen_llvm.cpp (guard logic + per-task accumulator reset), codegen_llvm.h (new ad_stack_fn_scope_bytes_ field), and tests/python/test_adstack.py (child-process test that verifies the guard fires). This is the focused 'codegen budget guard' slice of the autodiff safety series, matching the split that was requested.
Security risks
No security-sensitive code paths touched. The guard is a compile-time check that produces a loud abort rather than silent memory corruption; it cannot be reached from user-supplied data at runtime, only from kernel compilation. No auth, crypto, or permission logic is involved.
Level of scrutiny
Moderate. The guard is CPU-only (correctly gated on arch_is_cpu(current_arch()) in the .cpp; the header comment already notes the gate and the GPU-local-memory distinction), uses current_task->name for per-task identification in the error message, and resets the accumulator at the start of each offloaded task. Earlier rounds of review caught and fixed a real ordering bug (pre_finalize timing), a header-size mismatch (sizeof(int32) vs sizeof(int64)), a missing arch guard, and several comment inaccuracies — all addressed by the author. No bugs were found by the automated system on the current revision.
Other factors
My two inline comments posted today (2026-04-21 07:35 and 08:48) were false positives: the diff already uses current_task->name and the header already carries the 'On CPU arches only' scoping language. The subprocess.run timeout nit in the budget-guard test is a minor style concern; the teardown test's explicit comment explains the team's rationale for relying on pytest's per-test timeout instead, which applies equally here.
… per-task adstack budget
c9751cf to
621e760
Compare
3804091 to
f21a94c
Compare
Guard against LLVM worker-thread stack overflow from large per-task adstack budget
TL;DR
The check is CPU-only by design: on CUDA / AMDGPU the same LLVM allocas lower to per-thread GPU local memory (a separate address space sized by the driver, not shared with the CPU call stack), so the 256 KB CPU-stack budget is not meaningful there. A non-gated version of the check would falsely reject valid GPU kernels with f64 loop-carried variables (4 adstacks at
ad_stack_size=4096already cross 256 KB).Why 256 KB
macOS secondary threads default to a ~512 KB stack. The worker-thread pool used by the LLVM JIT runs on those. A function-scope alloca for an adstack sits on the LLVM stack frame; if the sum of its sizes across a task crosses the thread's stack limit, the frame corrupts adjacent stack memory (typically the next page, sometimes guard pages). Downstream reverse-mode accumulators read zero, producing silently-wrong gradients with no crash.
256 KB is a conservative upper bound that leaves ~256 KB for other locals and nested call frames. Linux defaults to ~8 MB per secondary thread, so the limit is strictly conservative there — the codegen is more protective on Linux than strictly necessary, which is fine.
Why QD_ERROR_IF and not throw
QD_ERROR_IFlogs a descriptive message and then callsQD_UNREACHABLE(compiler-assisted abort). The codegen runs inside the LLVM compilation worker thread pool, where a C++ exception thrown from a worker doesn't cleanly propagate back to the Python-level caller (pybind11's exception translation only works at pybind binding boundaries, which are on the main thread). ThrowingQuadrantsRuntimeErrorfrom here results instd::terminate()and a bare SIGABRT — a worse user experience than the guard's current behaviour, which at least logs the message before aborting.The trade-off is that the abort shows up as a SIGABRT in the child process rather than a catchable Python exception. The testing approach below handles that.
Obsoleted by subsequent PRs
Autodiff 12 (LLVM heap-backed adstack) moves the storage off the worker thread stack entirely, making this guard unnecessary. When that PR lands, the guard — along with the
ad_stack_fn_scope_bytes_accumulator field — is removed. This PR is still useful in the interim to make the silent-corruption failure loud, and to provide coverage for any future path that might re-introduce function-scope allocas for adstack storage.Changes
quadrants/codegen/llvm/codegen_llvm.{h,cpp}TaskCodeGenLLVMgains anad_stack_fn_scope_bytes_accumulator field, reset to 0 at the top of each offloaded task.init_offloaded_task_functionresets the accumulator.visit(AdStackAllocaStmt)tallies per-alloca size and raises viaQD_ERROR_IFwhen the per-task total crosses 256 KB on CPU.Tests
test_adstack_codegen_budget_guard_runs_in_child_processRuns the overflowing kernel in a child process (since
QD_ERROR_IFaborts the process rather than raising a catchable Python exception) and asserts:"autodiff-stack budget exceeded") appears in the child's combined stdout/stderr.Skip-gated on
qd.cpuhaving the adstack and f64 extensions. Kernel shape: fiveqd.f64loop-carried variables applyingqd.sininside a dynamicrange(n_iter[None])atad_stack_size=4096. Each adstack is8 + 4096 * 16 = 65,544 bytes ≈ 64 KB; five × 64 KB = 320 KB, comfortably past the 256 KB guard.The "dynamic
n_iterfrom a field" shape is load-bearing: with a Python-literalrange(3)the compile-time trip count is known and the determine-ad-stack-size pass sizes each stack to 3 slots, notdefault_ad_stack_size=4096. Only a runtime-bound range defeats that pass and leaves each stack at the fulldefault_ad_stack_size.Side-effect audit
arch_is_cpu(current_arch()); never fires on GPU.Stack
Autodiff 9 of 13. Third commit of the "LLVM adstack safety" triplet split and the top-most of that split (this PR is #495). Based on #535 (runtime overflow). Followed by #490 (SPIR-V adstack).